feat: provide Request instances in skipped request callbacks#1927
feat: provide Request instances in skipped request callbacks#1927lorenz-lb wants to merge 1 commit into
Conversation
| ErrorHandler = Callable[[TCrawlingContext, Exception], Awaitable[Request | None]] | ||
| FailedRequestHandler = Callable[[TCrawlingContext, Exception], Awaitable[None]] | ||
| SkippedRequestCallback = Callable[[str, SkippedReason], Awaitable[None]] | ||
| SkippedRequestCallback = Callable[[Request, SkippedReason], Awaitable[None]] |
There was a problem hiding this comment.
This is a breaking change to a public API. Existing callbacks like async def cb(url: str, reason) will now receive a Request; string usage like url.startswith(...) will crash. It should work with both str and Request.
There was a problem hiding this comment.
I don't think you can start sending Request instances into existing callbacks that expect str without breaking BC without runtime type inspection — do we really want to go that way?
There was a problem hiding this comment.
I don't think you can start sending Request instances into existing callbacks that expect str without breaking BC without runtime type inspection — do we really want to go that way?
You're right. Well, this may be a v2.0 material.
| if self._on_skipped_request: | ||
| try: | ||
| await self._on_skipped_request(url, reason) | ||
| await self._on_skipped_request(request, reason) |
There was a problem hiding this comment.
This now forwards request straight to the callback, but add_requests() (just below, ~line 841) wasn't updated — it still builds skipped from the original Sequence[str | Request] and passes a possibly-str item here:
for request in requests:
check_url = request.url if isinstance(request, Request) else request
if await self._is_allowed_based_on_robots_txt_file(check_url):
allowed_requests.append(request)
else:
skipped.append(request) # <- can be a plain strSo with respect_robots_txt_file=True, await crawler.add_requests(['https://disallowed/...']) on a robots-disallowed URL delivers a str to the callback, and request.url (as in the docs example) raises AttributeError → UserDefinedErrorHandlerError.
Suggest normalizing str → Request once at the choke point (here, or in add_requests) instead of only in the two extract_links impls. That also makes the isinstance(request, Request) guard a few lines up dead code (it now contradicts the request: Request annotation). This path also has no test coverage.
| requests = list[Request]() | ||
| skipped = list[Request]() | ||
|
|
||
| def create_request(request_options: RequestOptions) -> Request | None: |
There was a problem hiding this comment.
create_request is now duplicated verbatim (including the multi-line debug message) with PlaywrightCrawler.extract_links (_playwright_crawler.py:464). Both crawlers extend BasicCrawler, so this looks like a good candidate for a single shared helper in crawlee/_utils (taking a logger) to keep the two copies from drifting.
| context.log.debug( | ||
| f'Skipping URL "{request_options["url"]}" due to invalid format: {exc}. ' | ||
| 'This may be caused by a malformed URL or unsupported URL scheme. ' | ||
| 'Please ensure the URL is correct and retry.' |
There was a problem hiding this comment.
Now that this helper is also used for robots-skipped, auto-discovered links (not just user enqueues), the "Please ensure the URL is correct and retry." wording is misleading — the operator never submitted this URL and there's nothing to retry. Consider a neutral message, or a separate one for the skip path.
| else context.request.loaded_url or context.request.url | ||
| ) | ||
| links_iterator = to_absolute_url_iterator(base_url, links_iterator, logger=context.log) | ||
| skipped_iterator = iter([]) |
There was a problem hiding this comment.
Minor: skipped_iterator = iter([]) followed by a conditional reassignment inside if robots_txt_file: reads a little awkwardly (and iter([]) infers Iterator[Never]). The previous if/else, or guarding the skipped-building loop under the same if robots_txt_file:, is clearer.
| request = create_request(request_options) | ||
|
|
||
| if request is not None: | ||
| skipped.append(request) |
There was a problem hiding this comment.
Two things about this skipped-building loop:
-
Behavior change / silent drop: previously every robots-disallowed URL reached the callback as a raw string; now they go through
Request.from_url, which enforces http/https (validate_http_url). A disallowed-but-non-http(s) or malformed-but-absolute URL (one that passesto_absolute_url_iteratorbut failsfrom_url) is now silently dropped and the skip callback never fires for it. Intended? -
Consistency: these skipped requests are built directly and don't pass through
transform_request_function, while the enqueued ones do. So a skippedRequestcarries the baselabel/user_data/enqueue_strategybut not the user's per-request transform — inconsistent with enqueued requests from the same links.
| requests = list[Request]() | ||
| skipped = list[Request]() | ||
|
|
||
| def create_request(request_options: RequestOptions) -> Request | None: |
There was a problem hiding this comment.
Same as the HTTP crawler: this create_request is a verbatim duplicate (candidate for a shared helper), and the skipped-building loop below has the same silent-drop and transform_request_function-not-applied behavior. See the comments on _abstract_http_crawler.py.
|
|
||
| requests = [call.args[0] for call in skip.call_args_list] | ||
|
|
||
| all(isinstance(request, Request) for request in requests) |
|
|
||
| requests = [call.args[0] for call in skip.call_args_list] | ||
|
|
||
| all(isinstance(request, Request) for request in requests) |
|
|
||
| requests = [call.args[0] for call in skip.call_args_list] | ||
|
|
||
| all(isinstance(request, Request) for request in requests) |
|
Thanks for the review! Sorry for some obvious mistakes, I started to use crawlee only recently, clearly skill issue on my side. I'll incorporate your feedback very very soon! |
Description
This PR changes the skipped request handler to receive a
Requestobject instead of only the URLstr.Skipped URLs are now converted into
Requestobjects before the callback is invoked. This ensures that request metadata remains available for skipped requests, includingrequest.user_dataand other request attributes.Issues
Testing
Checklist
Future improvements
While working on this change, I noticed that many parts of the internal processing pipeline operate on
str | Requestunions.A possible future improvement could be to standardize on
Requestobjects internally and only acceptstr | Requestat the public API boundary. URLs could then be wrapped intoRequestinstances immediately, allowing the rest of the codebase to operate exclusively onRequestobjects.This could simplify typing, reduce repeated union handling, and make request metadata consistently available throughout the pipeline. However, such a change would require broader refactoring and additional testing, so it is outside the scope of this PR.